Supplementary materials for ‘TRADITIONAL AND DIGITAL TYPOLOGIES COMPARED: THE EXAMPLE OF ITALIAN PROTOHISTORY’

Author
Affiliation
Lorenzo Cardarelli

Department of Ricerca e Innovazione Umanistica, University of Bari Aldo Moro | Associate researcher, CNR - Istituto di Scienze del Patrimonio Culturale, Area della Ricerca Roma 1

Abstract

The study of material culture, particularly ceramics, is an important aspect of archaeology. The classification and typology of ceramics is often a key focus, with traditional methods based on the observation of specific morphological characteristics and the creation of taxonomies based on these types. However, recent advances in digital technologies and the use of Machine Learning have led to the development of new, digital approaches to ceramic classification. These approaches rely on statistical methods and computer science techniques to quantify vessel morphology and create multivariate datasets. The aim of this study is to compare traditional and a proposed digital method of ceramic classification, using a database of approximately 1400 records from several Italian Bronze and Iron Age funerary contexts (Casinalbo and other contexts in the Po Valley and northern Italy, Pianello di Genga and Torre Galli). The proposed approach involves a combination of a non-linear dimensionality reduction algorithm and density-based clustering. The study will consider the similarities and differences between the two methods and the potential for combining them in a complementary manner. The results of the study will provide a deeper understanding of traditional typologies and the potential for digital methods to enhance and improve upon them.

Importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from aippa.cluster import HierarchicalDBSCAN
from aippa.DimensionReduction import Umap
from sklearn.preprocessing import MinMaxScaler
from scipy import sparse
import math
from scipy import stats
from sklearn import metrics
#pd.set_option('display.max_rows', None)

Define custom functions

Code
def plot_digital_types(data, show_id = False, info_selected = [], savefig = False, 
                        hspace = 0.5, wspace = 0.5, fontsize = 10, pot_title = "ids", sub_title = [], set_suptitle = [], total_info = []):
    fig = plt.figure(figsize=(10, len(data)*0.75))

    fig.subplots_adjust(hspace=hspace, wspace=wspace)
    for i, axi in enumerate(data):
        ax = fig.add_subplot(math.ceil(len(data)/5), 5, i+1)
        if show_id:
            ax.set_title(f"{info_selected[pot_title].values[i]} \n Traditional type:{info_selected[sub_title].values[i]}", fontsize=fontsize)
        if sub_title:
            ax.set_xlabel(info_selected[sub_title].values[i], fontsize=fontsize)
        im = ax.imshow(data[i].reshape(256, 256), cmap='viridis')
        ax.axis('off')
    print(f"Digital type: {info_selected.AI_type.values[0]}")



    print()
    plt.show()
Code
def LoadPots(path):
    pots = sparse.load_npz(path).toarray().reshape(-1, 256, 256)
    return pots
Code
def plot_pots(data, rows = 10, cols = 10, figsize = (10,10), show_id = False, info_selected = [], savefig = False, 
hspace = 0.5, wspace = 0.5, fontsize = 10, pot_title = "ids", sub_title = [], set_suptitle = []):
    fig, ax = plt.subplots(rows, cols, figsize=figsize,
                           subplot_kw=dict(xticks=[], yticks=[]))
    fig.subplots_adjust(hspace=hspace, wspace=wspace)
    for i, axi in enumerate(ax.flat):
        im = axi.imshow(data[i].reshape(256, 256), cmap='viridis')
    if show_id:
        for i, axi in enumerate(ax.flat):
            axi.set_title(info_selected[pot_title].values[i], fontsize=fontsize)
    if sub_title:
        for i, axi in enumerate(ax.flat):
            axi.set_xlabel(info_selected[sub_title].values[i], fontsize=fontsize)
    if set_suptitle:
        for i, axi in enumerate(ax.flat):
            if i == 0:
                axi.set_ylabel(f"Digital type: {set_suptitle}", fontsize=10)       
    if savefig:
        plt.savefig('pots.jpg', dpi=1200, bbox_inches='tight')
    plt.show()

Loading data

Loading binary matrices from the .npz file

pots = LoadPots("images_sparse.npz")

Display a portion of the dataset

plot_pots(pots, rows = 5, cols = 5, figsize = (10,10), show_id = False, info_selected = [], savefig = False, hspace = 0.05, wspace = 0.05)

Load the tabular data store in a .xlsx file

total_df = pd.read_excel("tabular_data.xlsx")

Let’s visualize a portion of the dataset:

  • ids: the unique identifier of the record

  • typology: the traditional typology’s name and bibliography

  • context: the archaeological context where the record was found

  • type: the traditional type’s name

  • class: a more general classification of the type developed by the author

  • grave: the grave number

  • pag; fig/tav; n: the bibliographic reference of the record

  • Taxonomic_index: the rank of the type in the traditional typology

total_df
ids typology context type Class grave pag fig/tav n Taxonomic_index
0 CSLN_0001 Cardarelli 2014 Casinalbo N1 Ciotole 12 118 85 5 127
1 CSLN_0002 Cardarelli 2014 Casinalbo E13 Vasi a collo distinto 10 118 85 6 75
2 CSLN_0003 Cardarelli 2014 Casinalbo B6 Vasi a profilo sinuoso 14 118 85 9 33
3 CSLN_0004 Cardarelli 2014 Casinalbo C20 Vasi a profilo biconico 15 119 86 1 55
4 CSLN_0005 Cardarelli 2014 Casinalbo I2 Tazze attingitoio 16 119 86 2 111
... ... ... ... ... ... ... ... ... ... ...
1355 PAD_0449 Cardarelli 2014 Bovolone O14 Scodelle 67 543 342 t.67 151
1356 PAD_0450 Cardarelli 2014 Bovolone O15 Scodelle 8 543 342 t.8 152
1357 PAD_0451 Cardarelli 2014 Olmo di Nogara P4 Boccali e bicchieri 1 544 343 t.434 157
1358 PAD_0452 Cardarelli 2014 Montata P4 Boccali e bicchieri 140 544 343 t.140 157
1359 PAD_0453 Cardarelli 2014 Bovolone P4 Boccali e bicchieri 43 544 343 t.43 157

1360 rows × 10 columns

As stated in the text, I remove the Unica and the other types attested once from the dataset. From a starting record number of 1360, I end up with 1149 records.

remove_pipeline = []
for context in total_df["context"].unique():
    selected_context = total_df.loc[total_df["context"] == context]
    major_two = selected_context.type.value_counts() > 1
    selected = selected_context.loc[selected_context.type.isin(major_two[major_two].index)]
    remove_pipeline.append(selected)

selected = pd.concat(remove_pipeline)

We can view the distribution of the records in each typology:

selected.groupby(["typology"]).count()[["ids"]]
ids
typology
Bianco Peroni et al. 2010 188
Cardarelli 2014 533
Pacciarelli 1999 428

and the number of types in each typology:

types_number = []
for context in selected["typology"].unique():
    selected_context = selected.loc[selected["typology"] == context]
    types_number.append((context, selected_context.type.unique().shape[0]))

pd.DataFrame(types_number, columns = ["typology", "types_number"]).set_index("typology")
types_number
typology
Cardarelli 2014 106
Bianco Peroni et al. 2010 44
Pacciarelli 1999 55

As stated in the paper, I define a new variable, the “Taxonomic index”, which is the rank of the maintained traditional types in the traditional typology.

taxonomic_index_pipeline = []
for context in selected["typology"].unique():
    selected_context = selected.loc[selected["typology"] == context]
    for i, ti in enumerate(selected_context.Taxonomic_index.sort_values().unique()):
        new_taxed_df = selected_context.loc[selected_context.Taxonomic_index == ti]
        new_taxed = np.arange(1, len(selected_context.Taxonomic_index.sort_values().unique())+1)
        new_taxed_df["weighted_taxonomic_index"] = new_taxed[i]
        taxonomic_index_pipeline.append(new_taxed_df)

selected = pd.concat(taxonomic_index_pipeline)
selected
ids typology context type Class grave pag fig/tav n Taxonomic_index weighted_taxonomic_index
908 PAD_0002 Cardarelli 2014 Copezzato A5 Vasi a profilo continuo NaN 472 271 37985b 5 1
909 PAD_0003 Cardarelli 2014 Copezzato A5 Vasi a profilo continuo NaN 472 271 41380b 5 1
910 PAD_0004 Cardarelli 2014 Copezzato A5 Vasi a profilo continuo NaN 472 271 56163 5 1
911 PAD_0005 Cardarelli 2014 Copezzato A5 Vasi a profilo continuo NaN 472 271 56165 5 1
914 PAD_0008 Cardarelli 2014 Capriano del Colle A5 Vasi a profilo continuo A 472 271 T.A 5 1
... ... ... ... ... ... ... ... ... ... ... ...
634 TRRGLL_0179 Pacciarelli 1999 Torre Galli K2 Pissidi 110 303 77 4 66 54
651 TRRGLL_0199 Pacciarelli 1999 Torre Galli K2 Pissidi 119 310 84A 2 66 54
757 TRRGLL_0319 Pacciarelli 1999 Torre Galli K2 Pissidi 183 349 123B 1 66 54
678 TRRGLL_0228 Pacciarelli 1999 Torre Galli K4 Pissidi 133 320 94 4 68 55
814 TRRGLL_0378 Pacciarelli 1999 Torre Galli K4 Pissidi 217 373 147A 4 68 55

1149 rows × 11 columns

We can selected the binary matrices using this selected tabular data

selected_pots = pots[selected.index]
selected.reset_index(inplace=True)
TRANSFORMATION_SELECTED = selected_pots 

Here we can check the shape of the binary matrices: 1149 records and 256 x 256 pixels

TRANSFORMATION_SELECTED.shape
(1149, 256, 256)

Define variability in “traditional” types

selected_type = selected.loc[selected.type == "Aa1"]
mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

Visualize the result

fig = plt.imshow(mean_pot, cmap = "viridis")
# title
plt.title(selected_type["type"].values[0])
# save
plt.savefig("mean_pot.jpg", dpi = 600)

Perform the same analysis for the all “traditional” types

pipeline_pots = []
mean_pots_img_traditional_type = []


for context in selected.typology.unique():

    selected_context = selected.loc[selected.typology == context]
    for type_ in selected_context.type.unique():
        selected_type = selected_context.loc[selected_context.type == type_]

        mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

        mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])



        mean_pots_img_traditional_type.append(mean_pot)
        pipeline_pots.append([context, type_, mean_pot_value, selected_type.Class.values[0], len(selected_type)])


pipeline_pots_traditional_type = pd.DataFrame(pipeline_pots, columns = ["context", "type", "mean_pot_value", "classe", "n_pots"])
pipeline_pots_traditional_type_context = pipeline_pots_traditional_type.loc[pipeline_pots_traditional_type.context == "Bianco Peroni et al. 2010"]

Visualize some traditional types’ heatmaps

plot_pots(np.array(mean_pots_img_traditional_type)[pipeline_pots_traditional_type_context.index.values], 4,4, pot_title="type", show_id=True, info_selected=pipeline_pots_traditional_type_context, savefig=False, figsize=(10, 10), hspace=0.5, wspace=0.05)

Visualize relationship between the morphological homogeneity index of the types and the number of records in each type

sns.scatterplot(data = pipeline_pots_traditional_type, x = "mean_pot_value", y = "n_pots", hue = "context",)
plt.ylabel("Pots number for each type")
plt.xlabel("Mean pot value")
plt.savefig("scatter.jpg", dpi = 600)

Define correlation between the morphological homogeneity index and the number of pots for each type

corr, p_value = stats.spearmanr(pipeline_pots_traditional_type["mean_pot_value"], pipeline_pots_traditional_type["n_pots"])
print(f"Correlation: {corr}, p_value: {p_value}")
Correlation: -0.8165483512594836, p_value: 2.486759269785813e-50

We obtain a correlation of -0.8, which is significant at the 0.05 level of significance.

In the table, the median values of the morphological homogeneity index and the median number of pots for each type are reported.

pipeline_pots_traditional_type.groupby(["context"]).median()
mean_pot_value n_pots
context
Bianco Peroni et al. 2010 0.577298 3.0
Cardarelli 2014 0.589438 3.5
Pacciarelli 1999 0.483486 5.0
sns.boxplot(data = pipeline_pots_traditional_type, y = "mean_pot_value", x = "context")
sns.stripplot(data = pipeline_pots_traditional_type, y = "mean_pot_value", x = "context", color = "black", alpha = 0.5)
plt.ylabel("Mean pot value")
plt.xlabel("Context")
plt.savefig("mean_pot_value_boxplot.jpg", dpi = 600)

The same calculation, using a defined number of records for each type

pipeline_pots = []
mean_pots_img = []

for context in selected.typology.unique():
    selected_context = selected.loc[selected.typology == context]
    for n in range(25):
        for type_ in selected_context.type.unique():
            selected_type = selected_context.loc[selected_context.type == type_]

            if len(selected_type) > 5:

                selected_type = selected_type.sample(5)

                mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

                mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])

                mean_pots_img.append(mean_pot)
                pipeline_pots.append([context, type_, mean_pot_value, selected_type.Class.values[0], len(selected_type)])


pipeline_pots_same_sample = pd.DataFrame(pipeline_pots, columns = ["context", "type", "mean_pot_value", "classe", "n_pots"])
pipeline_pots_same_sample["pipeline"] = "same_sample"
pipeline_pots_same_sample.groupby(["context"]).median()
mean_pot_value n_pots
context
Bianco Peroni et al. 2010 0.462855 5.0
Cardarelli 2014 0.562172 5.0
Pacciarelli 1999 0.464572 5.0
sns.boxplot(data = pipeline_pots_same_sample, y = "mean_pot_value", x = "context",)

plt.ylabel("Mean pot value")
plt.xlabel("Context")
plt.savefig("mean_pot_value_boxplot.jpg", dpi = 600)

Finally, we repeat the analysis using a random sample within each Class

pipeline_pots = []
mean_pots_img = []

for context in selected.typology.unique():
    selected_context = selected.loc[selected.typology == context]
    for n in range(200):
        random_class = np.random.choice(selected_context.Class.unique())

        if len(selected_context.loc[selected_context.Class == random_class]) > 5:
            selected_type = selected_context.loc[selected_context.Class.isin([random_class])].sample(5)

            mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

            mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])

            mean_pots_img.append(mean_pot)
            pipeline_pots.append([context, mean_pot_value, selected_type.Class.values[0], len(selected_type)])
        else:
            pass

    




pipeline_pots_random = pd.DataFrame(pipeline_pots, columns = ["context", "mean_pot_value", "classe", "n_pots"])
pipeline_pots_random["pipeline"] = "random"
pipeline_pots_random.groupby(["context"]).median()
mean_pot_value n_pots
context
Bianco Peroni et al. 2010 0.419328 5.0
Cardarelli 2014 0.472517 5.0
Pacciarelli 1999 0.428599 5.0
sns.boxplot(data = pipeline_pots_random, y = "mean_pot_value", x = "context",)
<AxesSubplot: xlabel='context', ylabel='mean_pot_value'>

pipeline_pots_stats = pd.concat([pipeline_pots_same_sample, pipeline_pots_random])
pipeline_pots_stats
context type mean_pot_value classe n_pots pipeline
0 Cardarelli 2014 A5 0.571379 Vasi a profilo continuo 5 same_sample
1 Cardarelli 2014 A6 0.576809 Vasi a profilo continuo 5 same_sample
2 Cardarelli 2014 A11 0.549884 Vasi a profilo continuo 5 same_sample
3 Cardarelli 2014 A13 0.459088 Vasi a profilo continuo 5 same_sample
4 Cardarelli 2014 A21 0.537647 Vasi a profilo continuo 5 same_sample
... ... ... ... ... ... ...
452 Pacciarelli 1999 NaN 0.461267 Anforette 5 random
453 Pacciarelli 1999 NaN 0.362806 Dolii e olle 5 random
454 Pacciarelli 1999 NaN 0.431780 Anforette 5 random
455 Pacciarelli 1999 NaN 0.433386 Anfore 5 random
456 Pacciarelli 1999 NaN 0.435718 Anforette 5 random

2082 rows × 6 columns

And finally, we compare the results obtained from using real types and those obtained from using random types

sns.boxplot(data = pipeline_pots_stats, y = "mean_pot_value", x = "context",hue = "pipeline")
plt.legend(loc='upper right')
plt.ylabel("Mean pot value")
plt.xlabel("Context")
plt.savefig("mean_pot_value_boxplot.jpg", dpi = 600)

Define noise

In this section we define a procedure for identifying, within the binary matrices, zones of overlap between the various types. These zones are considered ‘noise’ and are removed from the dataset.

median_pots = []
mean_pots_sup = []

for context in selected.typology.unique():
    selected_context = selected.loc[selected.typology == context]
    for type_ in selected_context.type.unique():
        selected_type = selected_context.loc[selected_context.type == type_]

        if len(selected_type) >= 5:
            median_pot = np.median(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)
            mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)


            median_pots.append(median_pot)
            mean_pots_sup.append(mean_pot)
fig, ax = plt.subplots(1, 3, figsize = (10, 5))
ax[0].imshow(mean_pots_sup[25])
ax[0].set_title("Mean")
ax[1].imshow(median_pots[25])
ax[1].set_title("Median")
ax[2].imshow(np.median(median_pots, axis = 0))
ax[2].set_title("Median of medians");

#plt.savefig("mean_median.jpg", dpi = 600)

Define new types

In this section we define a procedure for identifying new types within the binary matrices. the first step is to isolate the non-noisy zones within the binary arrays

cut_data = selected_pots[:, :150, :]

Let’s flatten the binary matrices for the analysis

flatted_pots = []
for pot in cut_data:
    flatted_pots.append(pot.flatten())
flatted_pots = np.array(flatted_pots)

flatted_pots.shape
(1149, 38400)

Definition of types, calculation of metrics is carried out individually for each type

Cardarelli 2014

pipeline_pots = []

mean_pots_img = []

selected_cont = selected.loc[selected.typology == "Cardarelli 2014"]
ump = Umap(flatted_pots[selected_cont.index], n_components=2, n_neighbors = 3, n_epochs = 500, random_state = 42, metric = "euclidean", min_dist = 0.05)

hdb = HierarchicalDBSCAN(MinMaxScaler().fit_transform(ump.reduction), info_selected = selected_cont, min_cluster_size = 3, cluster_selection_method = "leaf", min_samples = 2, gen_min_span_tree = True,)
selected_pipeline = selected_cont.copy()
selected_pipeline["AI_type"] = hdb.cluster_labels
no_noise = selected_pipeline.loc[selected_pipeline["AI_type"] != -1]

for type_ in no_noise.AI_type.unique():
    selected_type = no_noise.loc[no_noise.AI_type == type_]

    mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

    mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])

    if len(selected_type) >= 3:
        std_score = np.std(selected_type.Taxonomic_index.values)
    else:
        std_score = np.nan

    fws = metrics.fowlkes_mallows_score(selected_type.type.values, selected_type.AI_type.values) 
    
    correspondence = selected_type.type.value_counts().index[np.where(selected_type.type.value_counts().values >= selected_type.type.value_counts().sum()*0.50)]
    if len(correspondence.values) > 0:
        correspondence = str(correspondence.values.tolist())
    else:
        correspondence = np.nan

    mean_pots_img.append(mean_pot)
    pipeline_pots.append([type_, mean_pot_value, len(selected_type), std_score, correspondence, fws])

pipeline_pots_cardarelli = pd.DataFrame(pipeline_pots, columns = ["type", "mean_pot_value", "n_pots", "std_score", "correspondence", "fws"])
Using these parameters HDBSCAN finds 104 clusters. 37 noise-points

We use this loop to count the number of digital types that exceed or equal a set limit of similarity with traditional types, defined by the fowlkes_mallows_score

corrisp = []

quartils = [0.25, 0.5, 0.75, 1]

for quart in quartils:
    corrisp.append((quart, len(pipeline_pots_cardarelli[pipeline_pots_cardarelli.fws >= quart])))
corrisp
[(0.25, 65), (0.5, 31), (0.75, 3), (1, 2)]
pipeline_pots_casinalbo = pipeline_pots_cardarelli.rename(columns = {"type": "AI_type", "correspondence": "correspondence", "n_pots": "N_pots", "mean_pot_value": "Mean_pot_value", "std_score": "Tassonomic_variability"})

Let’s visualise the results!

for digital_type in pipeline_pots_casinalbo.AI_type.values:
    selected_digit_type = no_noise.loc[no_noise.AI_type == digital_type]
    plot_digital_types(data = selected_pots[selected_digit_type.index], info_selected = selected_digit_type, show_id=True, pot_title="ids", sub_title="type", )
Digital type: 73

Digital type: 68

Digital type: 11

Digital type: 26

Digital type: 63

Digital type: 34

Digital type: 84

Digital type: 62

Digital type: 35

Digital type: 29

Digital type: 12

Digital type: 93

Digital type: 75

Digital type: 86

Digital type: 5

Digital type: 87

Digital type: 74

Digital type: 92

Digital type: 10

Digital type: 99

Digital type: 82

Digital type: 70

Digital type: 52

Digital type: 55

Digital type: 57

Digital type: 61

Digital type: 49

Digital type: 27

Digital type: 56

Digital type: 37

Digital type: 36

Digital type: 66

Digital type: 80

Digital type: 83

Digital type: 69

Digital type: 38

Digital type: 102

Digital type: 25

Digital type: 72

Digital type: 103

Digital type: 101

Digital type: 51

Digital type: 100

Digital type: 91

Digital type: 98

Digital type: 88

Digital type: 81

Digital type: 33

Digital type: 77

Digital type: 94

Digital type: 78

Digital type: 40

Digital type: 43

Digital type: 44

Digital type: 41

Digital type: 23

Digital type: 67

Digital type: 59

Digital type: 4

Digital type: 90

Digital type: 96

Digital type: 95

Digital type: 76

Digital type: 97

Digital type: 3

Digital type: 60

Digital type: 71

Digital type: 32

Digital type: 20

Digital type: 39

Digital type: 28

Digital type: 0

Digital type: 14

Digital type: 89

Digital type: 58

Digital type: 85

Digital type: 21

Digital type: 79

Digital type: 22

Digital type: 42

Digital type: 50

Digital type: 64

Digital type: 6

Digital type: 1

Digital type: 65

Digital type: 8

Digital type: 13

Digital type: 7

Digital type: 53

Digital type: 17

Digital type: 9

Digital type: 45

Digital type: 31

Digital type: 18

Digital type: 24

Digital type: 47

Digital type: 48

Digital type: 15

Digital type: 54

Digital type: 19

Digital type: 16

Digital type: 30

Digital type: 46

Digital type: 2

pipeline_pots_casinalbo.median()
AI_type                   51.500000
Mean_pot_value             0.663603
N_pots                     4.000000
Tassonomic_variability    16.316210
fws                        0.365148
dtype: float64
pipeline_pots_casinalbo
AI_type Mean_pot_value N_pots Tassonomic_variability correspondence fws
0 73 0.499812 10 50.140203 NaN 0.365148
1 68 0.613002 5 24.489998 NaN 0.316228
2 11 0.564243 6 2.426703 ['A6'] 0.516398
3 26 0.468910 8 41.036379 ['I16'] 0.500000
4 63 0.782143 3 2.160247 NaN 0.000000
... ... ... ... ... ... ...
99 19 0.647845 3 2.357023 ['O6'] 0.577350
100 16 0.788712 3 2.449490 NaN 0.000000
101 30 0.746097 3 0.000000 ['O6'] 1.000000
102 46 0.649857 5 1.166190 ['O12'] 0.547723
103 2 0.726586 3 1.414214 ['O12'] 0.577350

104 rows × 6 columns

The procedure applied to the 2014 Cardarelli type is repeated for the other typologies:

Bianco Peroni et al 2010

pipeline_pots = []

mean_pots_img = []

selected_cont = selected.loc[selected.typology == "Bianco Peroni et al. 2010"]
ump = Umap(flatted_pots[selected_cont.index], n_components=2, n_neighbors = 3, n_epochs = 500, random_state = 396, metric = "euclidean", min_dist = 0.00)

hdb = HierarchicalDBSCAN(MinMaxScaler().fit_transform(ump.reduction), info_selected = selected_cont, min_cluster_size = 3, cluster_selection_method = "leaf", min_samples = 2, gen_min_span_tree = True,)
selected_pipeline = selected_cont.copy()
selected_pipeline["AI_type"] = hdb.cluster_labels
no_noise = selected_pipeline.loc[selected_pipeline["AI_type"] != -1]

for type_ in no_noise.AI_type.unique():
    selected_type = no_noise.loc[no_noise.AI_type == type_]

    mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

    mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])

    if len(selected_type) >= 3:
        std_score = np.std(selected_type.Taxonomic_index.values)
    else:
        std_score = np.nan 

    fws = metrics.fowlkes_mallows_score(selected_type.type.values, selected_type.AI_type.values) 
    
    correspondence = selected_type.type.value_counts().index[np.where(selected_type.type.value_counts().values >= selected_type.type.value_counts().sum()*0.50)]
    if len(correspondence.values) > 0:
        correspondence = str(correspondence.values.tolist())
    else:
        correspondence = np.nan

    mean_pots_img.append(mean_pot)
    pipeline_pots.append([type_, mean_pot_value, len(selected_type), std_score, correspondence, fws])

pipeline_pots_peroni = pd.DataFrame(pipeline_pots, columns = ["type", "mean_pot_value", "n_pots", "std_score", "correspondence", "fws"])
Using these parameters HDBSCAN finds 37 clusters. 14 noise-points
corrisp = []

quartils = [0.25, 0.5, 0.75, 1]

for quart in quartils:
    corrisp.append((quart, len(pipeline_pots_peroni[pipeline_pots_peroni.fws >= quart])))
corrisp
[(0.25, 29), (0.5, 10), (0.75, 0), (1, 0)]
pipeline_pots_pianello = pipeline_pots_peroni.rename(columns = {"type": "AI_type", "correspondence": "Correspondence", "n_pots": "N_pots", "mean_pot_value": "Mean_pot_value", "std_score": "Tassonomic_variability"})
pipeline_pots_pianello.median()
AI_type                   18.000000
Mean_pot_value             0.580168
N_pots                     4.000000
Tassonomic_variability     3.299832
fws                        0.377964
dtype: float64
for digital_type in pipeline_pots_pianello.AI_type.values:
    selected_digit_type = no_noise.loc[no_noise.AI_type == digital_type]
    plot_digital_types(data = selected_pots[selected_digit_type.index], info_selected = selected_digit_type, show_id=True, pot_title="ids", sub_title="type", )
Digital type: 0

Digital type: 6

Digital type: 7

Digital type: 2

Digital type: 3

Digital type: 8

Digital type: 19

Digital type: 20

Digital type: 14

Digital type: 12

Digital type: 1

Digital type: 18

Digital type: 13

Digital type: 31

Digital type: 11

Digital type: 30

Digital type: 16

Digital type: 5

Digital type: 17

Digital type: 24

Digital type: 9

Digital type: 4

Digital type: 22

Digital type: 10

Digital type: 25

Digital type: 33

Digital type: 27

Digital type: 26

Digital type: 28

Digital type: 23

Digital type: 29

Digital type: 32

Digital type: 15

Digital type: 35

Digital type: 34

Digital type: 36

Digital type: 21

Pacciarelli 1999

pipeline_pots = []

mean_pots_img = []

selected_cont = selected.loc[selected.typology == "Pacciarelli 1999"]
ump = Umap(flatted_pots[selected_cont.index], n_components=2, n_neighbors = 3, n_epochs = 500, random_state = 396, metric = "euclidean", min_dist = 0.00)

hdb = HierarchicalDBSCAN(MinMaxScaler().fit_transform(ump.reduction), info_selected = selected_cont, min_cluster_size = 3, cluster_selection_method = "leaf", min_samples = 2, gen_min_span_tree = True,)
selected_pipeline = selected_cont.copy()
selected_pipeline["AI_type"] = hdb.cluster_labels
no_noise = selected_pipeline.loc[selected_pipeline["AI_type"] != -1]

for type_ in no_noise.AI_type.unique():
    selected_type = no_noise.loc[no_noise.AI_type == type_]

    mean_pot = np.mean(TRANSFORMATION_SELECTED[selected_type.index], axis = 0)

    mean_pot_value = np.mean(mean_pot.ravel()[mean_pot.ravel() > 0.0])

    if len(selected_type) >= 3:
        std_score = np.std(selected_type.Taxonomic_index.values)
    else:
        std_score = np.nan 
    
    fws = metrics.fowlkes_mallows_score(selected_type.type.values, selected_type.AI_type.values) 

    correspondence = selected_type.type.value_counts().index[np.where(selected_type.type.value_counts().values >= selected_type.type.value_counts().sum()*0.50)]
    if len(correspondence.values) > 0:
        correspondence = str(correspondence.values.tolist())
    else:
        correspondence = np.nan

    mean_pots_img.append(mean_pot)
    pipeline_pots.append([type_, mean_pot_value, len(selected_type), std_score, correspondence, fws])

pipeline_pots_pacciarelli = pd.DataFrame(pipeline_pots, columns = ["type", "mean_pot_value", "n_pots", "std_score", "correspondence", "fws"])
Using these parameters HDBSCAN finds 82 clusters. 26 noise-points
corrisp = []

quartils = [0.25, 0.5, 0.75, 1]

for quart in quartils:
    corrisp.append((quart, len(pipeline_pots_pacciarelli[pipeline_pots_pacciarelli.fws >= quart])))
corrisp
[(0.25, 53), (0.5, 18), (0.75, 4), (1, 4)]
pipeline_pots_torregalli = pipeline_pots_pacciarelli.rename(columns = {"type": "AI_type", "correspondence": "Correspondence", "n_pots": "N_pots", "mean_pot_value": "Mean_pot_value", "std_score": "Tassonomic_variability"})
pipeline_pots_torregalli.describe().median()
AI_type                   40.500000
Mean_pot_value             0.591406
N_pots                     4.451220
Tassonomic_variability     4.559172
fws                        0.337657
dtype: float64
pipeline_pots_torregalli
AI_type Mean_pot_value N_pots Tassonomic_variability Correspondence fws
0 49 0.526052 7 5.816935 NaN 0.308607
1 36 0.605945 7 5.194188 NaN 0.308607
2 65 0.511501 4 5.117372 ['Ab8'] 0.408248
3 7 0.489119 7 4.203983 NaN 0.377964
4 32 0.531090 4 6.557439 ['Aa1'] 0.408248
... ... ... ... ... ... ...
77 27 0.492008 6 1.674979 ['H7'] 0.632456
78 60 0.649458 3 1.699673 NaN 0.000000
79 28 0.393832 6 1.154701 ['H7'] 0.447214
80 29 0.645507 3 0.471405 ['H5'] 0.577350
81 70 0.612938 5 0.748331 NaN 0.447214

82 rows × 6 columns

for digital_type in pipeline_pots_torregalli.AI_type.values:
    selected_digit_type = no_noise.loc[no_noise.AI_type == digital_type]
    plot_digital_types(data = selected_pots[selected_digit_type.index], info_selected = selected_digit_type, show_id=True, pot_title="ids", sub_title="type", )
Digital type: 49

Digital type: 36

Digital type: 65

Digital type: 7

Digital type: 32

Digital type: 33

Digital type: 34

Digital type: 38

Digital type: 31

Digital type: 4

Digital type: 9

Digital type: 52

Digital type: 76

Digital type: 8

Digital type: 37

Digital type: 61

Digital type: 63

Digital type: 16

Digital type: 77

Digital type: 30

Digital type: 50

Digital type: 51

Digital type: 64

Digital type: 80

Digital type: 35

Digital type: 12

Digital type: 72

Digital type: 62

Digital type: 48

Digital type: 11

Digital type: 47

Digital type: 10

Digital type: 3

Digital type: 81

Digital type: 73

Digital type: 0

Digital type: 75

Digital type: 1

Digital type: 20

Digital type: 21

Digital type: 13

Digital type: 25

Digital type: 67

Digital type: 69

Digital type: 53

Digital type: 19

Digital type: 18

Digital type: 66

Digital type: 68

Digital type: 79

Digital type: 39

Digital type: 78

Digital type: 55

Digital type: 56

Digital type: 17

Digital type: 40

Digital type: 2

Digital type: 74

Digital type: 14

Digital type: 15

Digital type: 23

Digital type: 54

Digital type: 41

Digital type: 24

Digital type: 22

Digital type: 71

Digital type: 58

Digital type: 57

Digital type: 5

Digital type: 6

Digital type: 26

Digital type: 45

Digital type: 43

Digital type: 42

Digital type: 44

Digital type: 59

Digital type: 46

Digital type: 27

Digital type: 60

Digital type: 28

Digital type: 29

Digital type: 70

Merge the results of the three typologies…

Final considerations

pipeline_pots_same_sample["Typology"] = "traditional"
pipeline_pots_casinalbo["Typology"] = "digital"
pipeline_pots_casinalbo["context"] = "Cardarelli 2014"
pipeline_pots_pianello["Typology"] = "digital"
pipeline_pots_pianello["context"] = "Bianco Peroni et al. 2010"
pipeline_pots_torregalli["Typology"] = "digital"
pipeline_pots_torregalli["context"] = "Pacciarelli 1999"

pipeline_pots_casinalbo_selected = pipeline_pots_casinalbo[["context", "Mean_pot_value", "Typology"]]
pipeline_pots_pianello_selected = pipeline_pots_pianello[["context", "Mean_pot_value", "Typology"]]
pipeline_pots_torregalli_selected = pipeline_pots_torregalli[["context", "Mean_pot_value", "Typology"]]
digital_concat = pd.concat([pipeline_pots_casinalbo_selected, pipeline_pots_pianello_selected, pipeline_pots_torregalli_selected])
digital_concat = digital_concat.rename(columns = {"Mean_pot_value": "mean_pot_value"})
pipeline_pots_random = pipeline_pots_random.rename(columns = {"pipeline": "Typology"})
pipeline_traditional = pipeline_pots_same_sample[["context", "mean_pot_value", "Typology"]]
pipeline_pots_random = pipeline_pots_random[["context", "mean_pot_value", "Typology"]]
concated_ = pd.concat([pipeline_traditional, digital_concat, pipeline_pots_random])
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(x="context", y="mean_pot_value", hue="Typology", data=concated_)
### set x-axis label
ax.set_xlabel("Typology",fontsize=12)
### set y-axis label
ax.set_ylabel("Mean pot value",fontsize=12)

# save as jpg
fig.savefig('mean_pot_value.jpg', format='jpg', dpi=300, bbox_inches='tight')